Skip to content

fix: verify that errors are propagated on engine hooks for correct subgraphs#2399

Merged
SkArchon merged 8 commits intomainfrom
milinda/eng-8637-error-message-attached-to-multiple-unrelated-subgraphs
Dec 10, 2025
Merged

fix: verify that errors are propagated on engine hooks for correct subgraphs#2399
SkArchon merged 8 commits intomainfrom
milinda/eng-8637-error-message-attached-to-multiple-unrelated-subgraphs

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented Dec 9, 2025

This PR merges the changes from the engine where we pass only the relevant errors to the engine hooks OnFinished function.

Depends on wundergraph/graphql-go-tools#1351

Summary by CodeRabbit

  • Tests

    • Added telemetry tests to verify failing subgraph connections produce exception events on the correct engine-fetch spans (single and multiple failure scenarios), with helpers to simulate connection failures and capture spans.
  • Refactor

    • Simplified GraphQL request and subscription context initialization to a constructor-driven flow while preserving behavior.
  • Chores

    • Updated module dependency versions.

✏️ Tip: You can customize this high-level summary in your review settings.

Checklist

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 9, 2025

Walkthrough

Adds telemetry tests simulating subgraph connection failures to assert exception events on Engine - Fetch spans; refactors GraphQL and WebSocket codepaths to construct resolve.Context via resolve.NewContext(...) with explicit field assignments; upgrades github.com/wundergraph/graphql-go-tools/v2 in two go.mod files.

Changes

Cohort / File(s) Summary
Telemetry error span tests
router-tests/telemetry/telemetry_test.go
Added TestFlakyTelemetry with subtests for single and multiple subgraph failures; introduced simulateConnectionFailureOnClose helper; tests use an in-memory OTel exporter to assert exception events and otel.subgraph.name attribution on Engine - Fetch spans.
GraphQL handler context construction
router/core/graphql_handler.go
Replaced composite-literal construction + WithContext with resolve.NewContext(executionContext) and explicit field assignments (Variables, RemapVariables, Files, Request, RenameTypeNames, TracingOptions, InitialPayload, Extensions, ExecutionOptions). Behavior preserved.
WebSocket subscription context construction
router/core/websocket.go
Replaced inline resolve.Context literal in executeSubscription with withRequestContext(...)+resolve.NewContext(...) and explicit assignment of Variables, Request headers/ID, RenameTypeNames, RemapVariables, TracingOptions, Extensions, and optional InitialPayload. Flow preserved.
Module bumps
router/go.mod, router-tests/go.mod
Updated required version of github.com/wundergraph/graphql-go-tools/v2 from v2.0.0-rc.240 to v2.0.0-rc.241 in both go.mod files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review OTel test assertions for correct span/event attachment and exact attribute keys (otel.subgraph.name) and ensure tests are not flaky.
  • Verify resolve.NewContext preserves context propagation, cancellation, and any side effects compared to the previous literal + WithContext pattern.
  • Confirm WebSocket subscription behavior (initial payload forwarding, headers, request ID) is unchanged after refactor.

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding tests to verify that errors from subgraph failures are correctly attached to Engine-Fetch spans for the correct subgraphs, and refactoring the resolve context initialization to support this verification.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 9, 2025

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-0421c0f7a0e656904bfd523ebb24a32d28d92745

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router-tests/telemetry/telemetry_test.go (1)

11257-11387: Engine‑Fetch error attribution tests look correct; minor naming/DRY nits

The new helper and subtests correctly enforce that only Engine - Fetch spans for failing subgraphs carry an exception event, which is exactly what you want to guard the new engine‑hook behavior.

Two small, non‑blocking suggestions:

  • The outer description "verify errors being attached to unrelated span subgraphs" reads opposite to what the tests assert (they ensure errors are not attached to unrelated subgraphs). Renaming it to something like "verify errors are only attached to failing subgraph spans" would avoid confusion.
  • The span‑inspection loop (filtering Engine - Fetch, extracting wg.subgraph.name, checking for a single exception event) is duplicated between the two subtests; a tiny local helper taking *testing.T, spans, and the expected error subgraph set would keep the expectation in one place and ease future changes.

Otherwise this block looks solid and aligned with the existing telemetry tests’ style.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24c0648 and 9504997.

📒 Files selected for processing (1)
  • router-tests/telemetry/telemetry_test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: JivusAyrus
Repo: wundergraph/cosmo PR: 2156
File: controlplane/src/core/repositories/SubgraphRepository.ts:1746-1763
Timestamp: 2025-09-08T20:57:07.946Z
Learning: The checkSubgraphSchema.ts file already correctly implements linked subgraph functionality, using byName(linkedSubgraph.name, linkedSubgraph.namespace) to fetch target subgraphs and properly handles parse(newSchemaSDL) for schema building. The implementation doesn't need fixes for byId usage or schema parsing as it's already correct.
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
🧬 Code graph analysis (1)
router-tests/telemetry/telemetry_test.go (2)
router-tests/testenv/testenv.go (5)
  • Run (105-122)
  • Config (292-349)
  • SubgraphsConfig (374-387)
  • SubgraphConfig (389-393)
  • Environment (1761-1797)
router/pkg/otel/attributes.go (1)
  • WgSubgraphName (24-24)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Analyze (go)
  • GitHub Check: build_test
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: Analyze (javascript-typescript)

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.59%. Comparing base (cbbd5d9) to head (e7936a2).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2399      +/-   ##
==========================================
+ Coverage   46.62%   55.59%   +8.97%     
==========================================
  Files         340      213     -127     
  Lines       34302    23240   -11062     
  Branches      251        0     -251     
==========================================
- Hits        15992    12921    -3071     
+ Misses      17055     9072    -7983     
+ Partials     1255     1247       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot added the router label Dec 9, 2025
@SkArchon SkArchon force-pushed the milinda/eng-8637-error-message-attached-to-multiple-unrelated-subgraphs branch from 60cd23f to eaf7190 Compare December 10, 2025 09:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router-tests/telemetry/telemetry_test.go (1)

11257-11387: Engine‑Fetch span attribution checks look good; optionally assert expected spans exist

The new tests correctly ensure that only the failing subgraphs’ Engine - Fetch spans carry a single exception event, and that non‑failing subgraphs’ spans remain event‑free. That directly guards against the regression where errors are attached to unrelated subgraph spans.

One optional hardening: currently the loops will pass even if no Engine - Fetch span is emitted for a failing subgraph (e.g., if that span creation regresses entirely). You could track whether you actually saw the expected subgraph spans and assert at the end:

-       subgraphThatShouldHaveError := "products"
+       subgraphThatShouldHaveError := "products"
+       seenProducts := false
...
-                       if subgraphName == subgraphThatShouldHaveError {
-                            require.True(t, hasErrorEvent)
-                       } else {
-                            require.False(t, hasErrorEvent)
-                       }
+                       if subgraphName == subgraphThatShouldHaveError {
+                            seenProducts = true
+                            require.True(t, hasErrorEvent)
+                       } else {
+                            require.False(t, hasErrorEvent)
+                       }
...
+       require.True(t, seenProducts, "expected an Engine - Fetch span for products")

and similarly track seenProducts / seenAvailability in the multi‑error case. This keeps the tests robust if span emission behavior ever changes, without altering the core intent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef35bbb and e7936a2.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (3)
  • router-tests/go.mod (1 hunks)
  • router-tests/telemetry/telemetry_test.go (1 hunks)
  • router/go.mod (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/go.mod
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2273
File: router/core/graphql_handler.go:0-0
Timestamp: 2025-11-19T15:13:57.821Z
Learning: In the Cosmo router (wundergraph/cosmo), error handling follows a two-phase pattern: (1) Prehandler phase handles request parsing, validation, and setup errors using `httpGraphqlError` and `writeOperationError` (in files like graphql_prehandler.go, operation_processor.go, parse_multipart.go, batch.go); (2) Execution phase handles resolver execution errors using `WriteError` in GraphQLHandler.ServeHTTP. Because all `httpGraphqlError` instances are caught in the prehandler before ServeHTTP is invoked, any error type checks for `httpGraphqlError` in the execution-phase WriteError method are unreachable code.
Learnt from: JivusAyrus
Repo: wundergraph/cosmo PR: 2156
File: controlplane/src/core/repositories/SubgraphRepository.ts:1746-1763
Timestamp: 2025-09-08T20:57:07.946Z
Learning: The checkSubgraphSchema.ts file already correctly implements linked subgraph functionality, using byName(linkedSubgraph.name, linkedSubgraph.namespace) to fetch target subgraphs and properly handles parse(newSchemaSDL) for schema building. The implementation doesn't need fixes for byId usage or schema parsing as it's already correct.
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/go.mod
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router-tests/go.mod
  • router-tests/telemetry/telemetry_test.go
📚 Learning: 2025-09-24T12:54:00.765Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2222
File: router-tests/websocket_test.go:2238-2302
Timestamp: 2025-09-24T12:54:00.765Z
Learning: The wundergraph/cosmo project uses Go 1.25 (Go 1.23+ minimum), so fmt.Appendf and other newer Go standard library functions are available and can be used without compatibility concerns.

Applied to files:

  • router-tests/go.mod
📚 Learning: 2025-09-24T12:54:00.765Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2222
File: router-tests/websocket_test.go:2238-2302
Timestamp: 2025-09-24T12:54:00.765Z
Learning: The wundergraph/cosmo project uses Go 1.25 (Go 1.25 minimum), so fmt.Appendf and other newer Go standard library functions are available and can be used without compatibility concerns.

Applied to files:

  • router-tests/go.mod
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/go.mod
  • router-tests/telemetry/telemetry_test.go
📚 Learning: 2025-08-15T10:21:45.838Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2142
File: helm/cosmo/Chart.yaml:0-0
Timestamp: 2025-08-15T10:21:45.838Z
Learning: In the WunderGraph Cosmo project, helm chart version upgrades and README badge synchronization are handled in separate helm release PRs, not in the initial version bump PRs.

Applied to files:

  • router-tests/go.mod
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
🧬 Code graph analysis (1)
router-tests/telemetry/telemetry_test.go (4)
router-tests/testenv/testenv.go (6)
  • Run (105-122)
  • Config (292-349)
  • SubgraphsConfig (374-387)
  • SubgraphConfig (389-393)
  • Environment (1735-1771)
  • GraphQLRequest (1911-1919)
router/pkg/trace/tracetest/tracetest.go (1)
  • NewInMemoryExporter (11-17)
router/core/operation_processor.go (1)
  • GraphQLRequest (194-199)
router/pkg/otel/attributes.go (1)
  • WgSubgraphName (24-24)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router-tests/go.mod (1)

29-29: Upgrade graphql-go-tools dependency to align with upstream engine fix.

The version bump from v2.0.0-rc.240 to v2.0.0-rc.241 appears intentional and coordinated with code changes in router/core/graphql_handler.go and router/core/websocket.go (which switch to resolve.NewContext pattern). This aligns with the PR objective to merge engine changes for error propagation fixes in hooks.

Please confirm that v2.0.0-rc.241 is the correct version per the upstream graphql-go-tools PR #1351 merge status, and verify there are no breaking changes introduced by this RC version bump when integrated with the code refactors.

@SkArchon SkArchon merged commit a5ca15d into main Dec 10, 2025
59 of 62 checks passed
@SkArchon SkArchon deleted the milinda/eng-8637-error-message-attached-to-multiple-unrelated-subgraphs branch December 10, 2025 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants